R and Hierarchical Agglomerative Clustering

Struggling for the past few hours to draw a dendrogram for a given similarity matrix. Having searched through various cluster analysis software, I finally decided to go ahead with R. Most of the cluster analysis software available on the web start with the initial data and generate either a similarity matrix or distance matrix which is stored as object in memory. Finally, they use the similarity matrix or distance matrix to generate clusters.

Since I am using a different method for calculating the similarity, the given similarity measures in cluster analysis software were of no use to me. FYI, I am trying to discover latent semantic structure between terms. For that, I am using SVD as a measure to generate similarity between terms.

I have a similarity matrix that consist of terms as rows and terms as columns, which looks like follows:

J    G   A   T   I
J   1,-0.12,0.84,0,0.19
G,-0.12,1,0.43,0,0.95
A,0.84,0.43,1,0,0.69
T,0,0,0,1,0
I,0.19,0.95,0.69,0,1

R is an excellent tool that allows me to input a similarity matrix, then convert it into distance matrix. The distance matrix is further used for generating clusters which can be plotted using dendrogram, quick and easy. The following R script achieve the objective.

A<-matrix(c(1,-0.12,0.84,0,0.19,-0.12,1,0.43,0,0.95,0.84,0.43,1,0,0.69,0,0,0,1,0,0.19,

0.95,0.69,0,1),nrow=5,ncol=5,byrow=TRUE)

dimnames(A)<-list(c(“java”,”game”,”application”,”travel”,”iphone”),c(“java”,”game”,

“application”,”travel”,”iphone”))
sim2dist <- function(mx) as.dist(sqrt(outer(diag(mx), diag(mx), “+”) – 2*mx))
D = sim2dist(A)
hc = hclust(D)
plot(hc)

The output looks like follows:

HAC Clustering

We can also use agnes package for HAC and kmeans

A<-matrix(c(1,-0.12,0.84,0,0.19,-
0.12,1,0.43,0,0.95,0.84,0.43,1,0,0.69,0,0,0,1,0,0.19,

0.95,0.69,0,1),nrow=5,ncol=5,byrow=TRUE)
dimnames(A)<-list(c(“java”,”game”,”application”,”travel”,”iphone”),
c(“java”,”game”,”application”,”travel”,”iphone”))
#sim2dist <- function(mx) as.dist(sqrt(outer(diag(mx), diag(mx), “+”) – 2*mx))
#D = sim2dist(A)
#Using hclus for HAC
#hc = hclust(D)
#plot(hc)

#using agnes for HAC
#hc <- agnes(A,diss=FALSE,metric=”euclidean”,stand=FALSE,method=”single”)
#print(hc)
#plot(hc,ask=FALSE,which.plots=NULL)

#using kmeans for clusering
km<-kmeans(A,3,15)
print(km)
plot(x, col=km$cluster)
points(km$centers,col=1:2,pch=8)

January 10, 2012 at 9:53 pm Leave a comment

Creativity in Kids

Quite concerned unconsciously the way my dear son behaves by not focusing or concentrating whenever asked to do something related to studies other than eating icecreams or water melons.

Today, I came across this interesting video [1] on Ted.com about how our education system kills creativity.

An interesting snippet from the talk:

Lynne had been underperforming at school, so her mother took her to the doctor and explained about her fidgeting and lack of focus. After hearing everything her mother said, the doctor told Lynne that he needed to talk to her mother privately for a moment. He turned on the radio and walked out. He then encouraged her mother to look at Lynne, who was dancing to the radio. The doctor said, she is not sick, she is a dancer and encouraged Lynne’s mother to take her to dance school

I am sure most of the doctors will put her on medication.

So, what is he doing now? She is a ballerina, dancer, actor, theatre director, television director and choreographer noted for her popular theatre choreography associated with the iconic musicals Cats and the current longest running show in Broadway history, The Phantom of the Opera. She is a multi millionaire, has her own production company.

My take on all this, every child is born with a talent. parents job is to
identify it
encourage it
support it
leave rest to god….. R U Listening???? am sure you are
Gillian Barbara Lynne [2]
1. http://www.ted.com/talks/ken_robinson_says_schools_kill_creativity.html

2. http://en.wikipedia.org/wiki/Gillian_Lynne

July 21, 2010 at 2:27 am Leave a comment

Punctuation

Now, this is an interesting post.
It often goes unnoticed for me being not a native speaker.
Now, here is a sentence which goes like this:
A woman without her man is nothing.

The above sentence can mean different with the punctuation; lets see

Way1
A woman, without her man, is nothing.

Way 2

A woman; without her, man is nothing.

Isn’t it interesting.

I remember the first research paper, that I wrote way back in 2006, received some comments from the reviewer that punctuation needs review especially a, the, an, and commas.

That time I wondered that we non native english speaker can speak and understand but when it comes to writing,
we are not good. Believe me, it took me a lot of time to understand this and more time to accept it and much more to admit it.

This post is a copy of another post .

July 3, 2010 at 8:20 am Leave a comment

JOSEKI with SDB and MySQL

1. install JOSEKI

2. Install SDB from http://sourceforge.net/projects/jena/files/SDB/
Download the zip file, in my case, I stored the zip file sdb-1.3.1.zip in \usr\local\SDB_svn

3. unzip SDB-1.3.1.zip
This will create a directory SDB-1.3.1 under \usr\local\SDB_svn
All the SDB arsenal is located inside SDB-1.3.1

4. Install MySql
4.1 sudo apt-get install mysql-server
4.2 sudo apt-get install mysql-query-browser
4.3 mysql -u root -p
you will be prompted to enter password.
4.5 create database rdf2wav CHARACTER SET UTF8

5. Create a file sdb.ttl under \usr\local\SDB_svn\SDB-1.3.1
Write the following in sdb.ttl (dont include ##############)
#######################################
# See Store/ for example sdb files.

@prefix sdb: .
@prefix rdfs: .
@prefix rdf: .
@prefix ja: .

# MySQL – InnoDB
rdf:type sdb:Store ;
sdb:layout “layout2” ;
sdb:connection ;
sdb:engine “InnoDB” ; # MySQL specific
.
rdf:type sdb:SDBConnection ;
sdb:sdbType “MySQL” ; # Needed for JDBC URL
sdb:sdbHost “localhost” ; # or the IP address of the database server
sdb:sdbName “rdf2wav” ; # MySQL database name
sdb:sdbUser “root”; #mysql user name
sdb:sdbPassword “bike”; #mysql password
sdb:driver “com.mysql.jdbc.Driver” ;
.

########################################

6. Write the following in .bashrc of your home directory

#####################################
export SDBROOT=/usr/local/SDB_svn/SDB-1.3.1
export SDB_JDBC=/usr/local/SDB_svn/SDB-1.3.1/lib/mysql-connector-java-5.1.12-bin.jar
export PATH=$PATH:$SDBROOT
export CLASSPATH=/usr/local/SDB_svn/SDB-1.3.1/lib/sdb-1.3.1.jar
#####################################

Note that, you need to download mysql-connector-java-5.1.12-bin.jar inorder to connect SDB with the SqlServer. If you dont have it, download it and store it in the lib forder of SDBROOT.

7. Restart the terminal and execute the following command, which will create 4 tables
bin/sdbconfig –sdb-sdb.ttl –create

The four tables that were created are
Nodes, Prefixes, Quads, and Triples.

8. format the tables
bin/sdbconfig –sdb=sdb.ttl –format

9. Run the test suite
bin/sdbconfig -v –time -sdb=sdb.ttl testing/manifest-sdb.ttl
if this runs fine with no error, then everything is correct until now.

10. Load the RDF data
bin/sdbload -v –time -sdb=sdb.ttl /path/to/your/rdffile
ex:
bin/sdbload -v –time -sdb=sdb.ttl /usr/local/hadoop/I0/University0_0.owl
or
bin/sdbload -v –time -sdb=sdb.ttl /usr/local/hadoop/I0/*

The last statement is useful if you have multiple RDF files.

The statement above will load the RDF data in the MySql tables.

11. Execute the query
bin/sdbquery -v –time -sdb=sdb.ttl ‘select ?x ?y ?z where {?x ?y ?z} limit 10’

If you get results, then SDB is configured with MySql to execute SPARQL queries

12. Next step is to configure JOSEKI with SDB and MySql
modify the data section of joseki-config-sdb.ttl and rename it as joseki-config.ttl
My joseki-config.ttl data section looks like this

rdf:type sdb:Store ;

rdfs:label “SDB” ;

sdb:layout “layout2” ;

sdb:connection

[ rdf:type sdb:SDBConnection ;

sdb:sdbType “MySql” ;

sdb:sdbHost “localhost” ;
sdb:sdbName “rdf2wav” ;
sdb:sdbUser “root”;
sdb:sdbPassword “bike”;

]

.

13. Start rdf server

bin/rdfserver

The above statement execute from the terminal. Start the terminal, go to the JOSEKI installation folder.

14. http://localhost:2020 for testing SPARQL query

June 29, 2010 at 3:56 am 11 comments

Configure Joseki on Ubuntu

1. Download Joseki  – put it under /home/kumar/Documents/

2. Unzip zip file, there will be a folder Joseki-3.4.1

3. In your favourite editor, open .bashrc under your home folder. in my case it is /home/kumar/.bashrc

4. At the end of .bashrc, add the following lines

export PATH=$PATH:/home/kumar/Documents/Joseki-3.4.1

export JOSEKIROOT=/home/kumar/Documents/Joseki-3.4.1

OR

export JOSEKIROOT=/home/kumar/Documents/Joseki-3.4.1

export PATH=$PATH:JOSEKIROOT

You need to declare JOSEKIROOT variable, because JOSEKI installation requires this as it looks for JOSEKIROOT environt variable to locate the JOSEKI files

You add JOSEKIROOT to the environment variable PATH, so that your terminal can locate JOSEKI files

5.  chmod 777 /home/kumar/Documents/Joseki-3.4.1/bin/*
This means, give all rights to all users
#Though Joseki documents recommends chmod u+x bin/* – I am always liberal

6. Open the file joseki-config.ttl located under /home/kumar/Documents/Joseki-3.4.1/

Modify the section ## Datasets of joseki-config.ttl.
Datasets sections contain the location of RDF Dataset on which to Run SPARQL queries.

Here you add the path to the folder where rdf file is located. When you issue the SPARQL query, the query will execute on the listed rdf files in joseki-config.ttl

Under the section ## Datasets, locate the following line

ja:content [ja:externalContent <file:/home/kumar/Desktop/lubm.rdf> ] ;

<file:path to your rdf file>

If you have more than one file, you can repeat this line. This gets inconvenient if there are thousands of file. To solve such issue, use MySQL Server with JOSEKI. We will look in this in the next post.

7. Open terminal, go to /home/kumar/Documents/Joseki-3.4.1
execute the rdfserver script
bin/rdfserver

if there are no errors, you can open a browser type the following URL http://localhost:2020/ and issue queries which will execute on the listed RDF files in joseki-config.ttl

June 29, 2010 at 3:40 am 12 comments

Can tweeting be dangerous?

No Doubt tweeting or facebook updates are the latest buzz. yes, people are socially active. But have anyone ever thought that it can be dangerous.

Now, some good aspects of tweeting is that celebrities and companies are using it as a communication channel to interact with people and customers. No doubt that Facebook played a greater role in Obama’s victory [1,2]. He communicated with the youth through social networking sites like Facebook etc

Now look at the downside of the social communication sites. Recently, Shashi Tharoor [3] gave up his foreign ministry post because of the controversy he ran into or say because of dirty politics. Modi[4], IPL chairman, was also very actively twitting about the updates and this made Income Tax department active. Shashi was also active thru Facebook sharing his thoughts with the world. As a result of all these controversies , Shashi lost his ministerial powers and very soon Modi will not be spared. There are rumours that he will resign after IPL 3 will be over.
The article[5] by V Raghunathan published in Times of India probes how Social Media led to the fall of Shashi Tharoor and Modi. Both fall from great heights, rightly said…..

Once more proven, it is either the money or the women that leads to fight. In this controversy both were involved.

[1] http://www.usnews.com/articles/opinion/2008/11/19/barack-obama-and-the-facebook-election.html

[2] http://www.facebook.com/notes/public-relations-sydney/how-social-media-won-obama-the-us-election/137322629321

[3] http://en.wikipedia.org/wiki/Shashi_Tharoor

[4] http://en.wikipedia.org/wiki/Lalit_Modi

[5] http://blogs.timesofindia.indiatimes.com/Outraged/entry/they-have-both-fallen-but

April 20, 2010 at 6:24 am 1 comment

Are Americans Afraid of Developing Countries: Food For Thought

An undeniable fact, the Americans think that they are most powerful race in the world and also most addicted to the belief that other nations in the world are in a conspiracy to undervalue them.

An excerpt from the article [1]  published in Times Of India dated Jan 29th 2010.
‘‘These nations (India, China, Germany etc) aren’t playing for second place. They’re putting more emphasis on math and science. They’re rebuilding their infrastructure. They’re making serious investments in clean energy because they want those jobs. Well, I do not accept second place for the United States of America,’’ said Obama.

[1] http://epaper.timesofindia.com/Default/Scripting/ArticleWin.asp?From=Archive&Source=Page&Skin=TOINEW&BaseHref=CAP/2010/01/29&PageLabel=3&EntityId=Ar00301&ViewMode=HTML&GZ=T

January 29, 2010 at 6:21 am Leave a comment

The genius Paul Erdos

Today, while working in my office, I came across a blog which referred to a great prodigy in mathematics, Paul Erdos. Earlier I saw some people mentioning on their website, that their erdos number is 1 or 2 or 3….. At that time I gave it a pass.

First of all, Paul Erdos is a Hungarian mathematician who published work on number theory and other stuff which I am never able to understand :).  I expect my son to understand all this. Ha Ha Ha…..

So this guy, never married, no kids, completely work alcoholic, nomad, prodigy and one thing very unique, he has the highest number of co author in his published work. Thats why one of his friend invented the concept of Erdos number.

so, if I have an erdos number of 1, that means, I have published a paper with him or I have been on of his co-author. And If I have an erdos number of 2, that means, some other guy with whom I have published a paper, that guy is the co-author of Erdos. Erdos himself has the erdos number of 0.

So what is my erdos number? Well, I have to still figure it out, at present I donot want to, but may be some time later. Check Yours thought…..

January 29, 2010 at 6:19 am Leave a comment

I am alone not lLonely

The tagline surfaced after reading the article[1] “In Solitary Company”

[1] Vinita Dawara Nangia, “In Solitary Company”, Aug 2009, Available online at http://epaper.timesofindia.com/Default/Scripting/ArticleWin.asp?From=Archive&Source=Page&Skin=TOINEW&BaseHref=CAP/2009/08/09&PageLabel=58&EntityId=Ar05800&ViewMode=HTML&GZ=T

January 24, 2010 at 4:55 am Leave a comment

An Interesting protocol to communicate train delays using missed calls :)

I can across a new article in TOI [1] that talks about how passengers on the Delhi-Palwal and Delhi-Rewari routes has been giving halfhourly updates so as to notify train delays

1. http://epaper.timesofindia.com/Default/Scripting/ArticleWin.asp?From=Archive&Source=Page&Skin=TOINEW&BaseHref=CAP/2010/01/24&PageLabel=12&EntityId=Ar01201&ViewMode=HTML&GZ=T

An Excerpt from the article, about how the communication works.
A commuter at the station sends a missed call to another passenger travelling on the train, asking her for an update on the train.‘‘If the train is 10 minutes late, the passenger repliles with one missed call. If there is a delay of 20 minutes, there are two missed calls. Three missed calls mean the delay is for over 20 minutes.
If the call is disconnected within five seconds even as the passenger on the platform makes the call, it signals that the train has been cancelled. If it gets a ‘busy mode’ response, the message is that the train is over an hour late. If you receive one missed call and then receive another missed call a few moments later, it is presumed that the train is delayed by over an hour.

January 24, 2010 at 4:53 am Leave a comment

Older Posts


Blog Stats

  • 21,964 hits

Top Posts